LCSTS: A Large Scale Chinese Short Text Summarization Dataset

نویسندگان

  • Baotian Hu
  • Qingcai Chen
  • Fangze Zhu
چکیده

Automatic text summarization is widely regarded as the highly difficult problem, partially because of the lack of large text summarization data set. Due to the great challenge of constructing the large scale summaries for full text, in this paper, we introduce a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which will be released to public soon. This corpus consists of over 2 million real Chinese short texts with short summaries given by the writer of each text. We also manually tagged the relevance of 10,666 short summaries with their corresponding short texts. Based on the corpus, we introduce recurrent neural network for the summary generation and achieve promising results, which not only shows the usefulness of the proposed corpus for short text summarization research, but also provides a baseline for further research on this topic.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Word-Character Model for Abstractive Summarization

Abstractive summarization is the popular research topic nowadays. Due to the difference in language property, Chinese summarization also gains lots of attention. Most of studies use character-based representation instead of word-based to keep out the error introduced by word segmentation and OOV problem. However, we believe that word-based representation can capture the semantics of the article...

متن کامل

A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification

ive text summarization has achieved successful performance thanks to the sequence-to-sequence model (Sutskever, Vinyals, and Le 2014) and attention mechanism (Bahdanau, Cho, and Bengio 2014). Rush, Chopra, and Weston (2015) first used an attention-based encoder to compress texts and a neural network language decoder to generate summaries. Following this work, recurrent encoder was introduced to...

متن کامل

Text Summarization Using Cuckoo Search Optimization Algorithm

Today, with rapid growth of the World Wide Web and creation of Internet sites and online text resources, text summarization issue is highly attended by various researchers. Extractive-based text summarization is an important summarization method which is included of selecting the top representative sentences from the input document. When, we are facing into large data volume documents, the extr...

متن کامل

Overview of the NLPCC 2015 Shared Task: Weibo-Oriented Chinese News Summarization

The Weibo-oriented Chinese news summarization task aims to automatically generate a short summary for a given Chinese news article, and the short summary is used for news release and propagation on Sina Weibo. The length of the short summary is less than 140 Chinese characters. The task can be considered a special case of single document summarization. In this paper, we will introduce the evalu...

متن کامل

Automatic Generation of Chinese Short Product Titles for Mobile Display

This paper studies the problem of automatically extracting a short title from a manually written longer description of E-commerce products for display on mobile devices. It is a new extractive summarization problem on short text inputs, for which we propose a feature-enriched network model, combining three different categories of features in parallel. Experimental results show that our framewor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015